Randy Johnson
6/21/2018
Sand flea data from McDonald, J.H. 1989. Selection component analysis of the Mpi locus in the amphipod Platorchestia platensis. Heredity 62: 243-249.
See the Handbook of Biological Statistics for additional examples.
eggs = β0 + β1 mg + ε
eggs = β0 + β1 mg + ε
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.6890 4.2009 3.021 0.0056 **
## mg 1.6017 0.6176 2.593 0.0154 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
eggs = 12.7 + 1.6*mg + ε
Two ways to test this assumption:
##
## Shapiro-Wilk normality test
##
## data: .std.resid
## W = 0.93555, p-value = 0.08517
See my code and ?shapiro.test for more details.
##
## No Studentized residuals with Bonferonni p < 0.05
## Largest |rstudent|:
## rstudent unadjusted p-value Bonferonni p
## 16 3.425071 0.0021295 0.059625
See my code and ?outlierTest from the car package for more details.
We see our borderline significant outlier (#16) has very little influence over the coefficients, but it does result in a change in the error terms.
* The trend line is not changed much if we exclude this data point from the analysis.
* It does shift the line slightly up, which results in a slightly larger estimate for all data points.
The 24th data point is also highlighted here. It has a larger residual (not an outlier) and is on the edge of the distribution. This gives it increased leverage and influence.
See my code and ?influencePlot from the car package for more details.
We see our borderline outlier in the top right corner is larger than we would expect it to be, but not quite outside of the confidence region within which we can expect our error terms to fall.
See my code and ?qqPlot from the car package for more details.
size = β0 + β1 length + ε
size = β0 + β1 length + β2 length2 + ε
size = β0 + β1 length + ε
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.59974 19.53898 0.031 0.976
## length 0.02435 0.06316 0.386 0.706
size = β0 + β1 length + β2 length2 + ε
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.634e+02 2.958e+02 -3.257 0.00625 **
## length 6.264e+00 1.913e+00 3.275 0.00603 **
## length2 -1.008e-02 3.088e-03 -3.263 0.00617 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
size = β0 + β1 length + ε
size = β0 + β1 length + β2 length2 + ε
mvnExample <- data_frame(x1 = rnorm(100),
y1 = x1 + rnorm(100), # normal error terms
y2 = x1 + rt(100, 3)) # not-normal error terms##
## Shapiro-Wilk normality test
##
## data: .std.resid
## W = 0.98544, p-value = 0.3417
##
## Shapiro-Wilk normality test
##
## data: .std.resid
## W = 0.91781, p-value = 1.079e-05
Model 1 has normal error terms.
##
## Call:
## lm(formula = y1 ~ x1, data = mvnExample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7996 -0.6230 0.1467 0.7049 1.9721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.15329 0.09750 -1.572 0.119
## x1 0.91487 0.09161 9.987 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9619 on 98 degrees of freedom
## Multiple R-squared: 0.5044, Adjusted R-squared: 0.4993
## F-statistic: 99.74 on 1 and 98 DF, p-value: < 2.2e-16
Model 2 has non-normal error terms.
##
## Call:
## lm(formula = y2 ~ x1, data = mvnExample)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4554 -1.1273 -0.1941 1.0091 8.7573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1716 0.1932 0.888 0.377
## x1 0.8026 0.1815 4.421 2.54e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.906 on 98 degrees of freedom
## Multiple R-squared: 0.1663, Adjusted R-squared: 0.1578
## F-statistic: 19.55 on 1 and 98 DF, p-value: 2.539e-05
mcol <- data_frame(x1 = rnorm(100),
x2 = x1 + rnorm(100), # x1 and x2 are correlated
x3 = rnorm(100),
y = x1 + x2 + x3 + rnorm(100))
# three models
model1 <- lm(y ~ x1 + x2 + x3, data = mcol) # everything
model2 <- lm(y ~ x1 + x3, data = mcol) # without x2
model3 <- lm(y ~ x2 + x3, data = mcol) # without x1# check for multicollinearity
vif(model1)## x1 x2 x3
## 2.271759 2.271338 1.005520
vif(model2)## x1 x3
## 1.000791 1.000791
vif(model3)## x2 x3
## 1.000606 1.000606
| model | x1 | x2 | x3 | r2 |
|---|---|---|---|---|
| 1 | 0.94 | 1.02 | 0.98 | 0.87 |
| 2 | 1.88 | 0.93 | 0.76 | |
| 3 | 1.59 | 1.03 | 0.81 |
Model coefficients for our three models. Notice how the coefficient for x3 doesn’t change much, but the coefficients for x1, x2 and R2 change quite a bit.
Adding a loess smoother highlights this periodic pattern in the data.
lm(y1 ~ x1, data = dat) %>%
durbinWatsonTest()## lag Autocorrelation D-W Statistic p-value
## 1 0.1940815 1.609687 0.002
## Alternative hypothesis: rho != 0
y1 ~ x1
##
## Suggested power transformation: 1.030183
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.05980537 Df = 1 p = 0.8068038
y2 ~ x1
##
## Suggested power transformation: 0.1781482
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 42.36333 Df = 1 p = 7.579794e-11
##
## Call:
## lm(formula = y1 ~ x1, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.86192 -0.55272 -0.04338 0.54027 2.54020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.04454 0.20709 0.215 0.82991
## x1 0.21653 0.06599 3.281 0.00122 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9762 on 198 degrees of freedom
## Multiple R-squared: 0.05158, Adjusted R-squared: 0.04679
## F-statistic: 10.77 on 1 and 198 DF, p-value: 0.00122
##
## Call:
## lm(formula = y2 ~ x1, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2229 -1.1031 -0.2710 0.8684 10.1981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3360 0.4025 0.835 0.405
## x1 0.6483 0.1282 5.055 9.76e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.897 on 198 degrees of freedom
## Multiple R-squared: 0.1143, Adjusted R-squared: 0.1098
## F-statistic: 25.56 on 1 and 198 DF, p-value: 9.755e-07
Many of the problems we encounter in linear regression stem from model choice.
Two most common transormations:
Original model: y ~ β0 + β1x + ε
With power transformation: y ~ β0 + β1x2 + ε
Original model: y ~ β0 + β1x + ε
With power transformation: y ~ β0 + β1sqrt(x) + ε
Original model: y ~ β0 + β1x + ε
With log transformation: y ~ β0 + β1ln(x) + ε
Original model: y ~ β0 + β1x + ε
With log transformation: ln(y) ~ β0 + β1x + ε
| Assumption | Cause | Consequence | Diagnosis |
|---|---|---|---|
| Linear Relationship | Incorrect model | Inaccurate predictions | car::crPlots() |
| Multivariate Normality | Incorrect model | Incorrect statistics | car::qqPlot() |
| Noisy data | (p-values / CIs) | shapiro.test(residuals) | |
| No/Little Multicollinearity | Correlated variables | Unstable model coefficients | car::vif() |
| No Autocorrelation | Non-independent data | Inefficient estimators | car::durbinWatsonTest() |
| Homoscedasticity | Incorrect model | Incorrect statistics | car::ncvTest() |
| “Bad” data | (p-values / CIs) | car::spreadLevelPlot() |
Statistics for Lunch Team * Greg Alvord * Eckart Bindewald * Taina Immonen * Brian Luke * George Nelson * Ravi Ravichandran * Tom Schneider